A unified approach to statistical language modeling for Chinese
نویسندگان
چکیده
This paper presents a unified approach to Chinese statistical language modeling (SLM). Applying SLM techniques like trigrams to Chinese is challenging because (1) there is no standard definition of words in Chinese, (2) word boundaries are not marked by spaces, and (3) there is a dearth of training data. Our unified approach automatically and consistently gathers a highquality training data set from the web, creates a high-quality lexicon, and segments the training data using this lexicon, all using a maximum likelihood principle, which is consistent with the trigram training. We show that each of the methods leads to improvements over standard SLM, and that the combined method yields the best pinyin conversion result reported.
منابع مشابه
Unified Statistical Modeling for Circuit Simulation
Accurate statistical simulation and modeling are important for IC design. Different types of statistical simulation require different types of statistical models. In this paper a unified approach to statistical modeling and characterization is presented. Based on physical process parameters and propagation of variance, it allows modeling of process extremes, distributional modeling for Monte Ca...
متن کاملModeling a semantic recommender system for medical prescriptions and drug interaction detection
Introduction: The administration of appropriate drugs to patients is one of the most important processes of treatment and requires careful decision-making based-on the current conditions of the patient and its history and symptoms. In many cases, patients may require more than one drug, or in addition to having a previous illness and receiving the drug, they need new drugs for the new illness, ...
متن کاملText classification in Asian languages without word segmentation
We present a simple approach for Asian language text classification without word segmentation, based on statistical -gram language modeling. In particular, we examine Chinese and Japanese text classification. With character -gram models, our approach avoids word segmentation. However, unlike traditional ad hoc -gram models, the statistical language modeling based approach has strong information...
متن کاملA Model of Iranian EFL Learners\' Cultural Identity: A Structural Equation Modeling Approach
This study aimed, firstly, to investigate the underlying components of Iranian cultural identity and, secondly, to confirm the aforementioned components via Structural Equation Modeling (SEM) analysis. In order to achieve these goals, the researchers reviewed the extensive local and international literature on language, culture and identity. Based on the literature and consultations with a grou...
متن کاملCultural Differences Encountered by a Novice Chinese Immersion Teacher in an American Kindergarten Immersion Classroom
The research objective of this study was to explore the cultural differences and challenges encountered by the Chinese Immersion Teacher (CIT) and how the CIT deal with the cultural differences in the immersion classroom. A qualitative case study approach was chosen for this research. The participant was a novice kindergarten immersion teacher who was born and educated in a Chinese-speaking cou...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2000